library(fpp3)
## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✔ tibble 3.1.8 ✔ tsibble 1.1.3
## ✔ dplyr 1.1.0 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.0 ✔ feasts 0.3.0
## ✔ lubridate 1.9.2 ✔ fable 0.3.2
## ✔ ggplot2 3.4.1 ✔ fabletools 0.3.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(tsibble)
tsibble objectsy <- tsibble(
Year = 2015:2019,
Observation = c(123, 39, 78, 52, 110),
index = Year
)
y
tsibble introduces temporal structure to tidy data
frames.
A regular tibble can be converted to a
tsibble object
z <- tibble(
Month = c("2019 Jan", "2019 Feb", "2019 Mar", "2019 Apr", "2019 May"),
Observation = c(50, 23, 34, 30, 25)
)
z
z |>
mutate(Month = yearmonth(Month)) |>
as_tsibble(index = Month)
mutate() converts the “Month” column to the appropriate
variable type, in this case yearmonth.
as_tsibble() sets the index for the tsibble
object
| Frequency | Function |
|---|---|
| Annual | start:end |
| Quarterly | yearquarter() |
| Monthly | yearmonth() |
| Weekly | yearweek() |
| Daily | as_date(), ymd() |
| Sub-Daily | as_datetime(), ymd_hms() |
olympic_running
The summary shows that the tsibble object has 312 rows,
4 columns, and the index is in 4 year intervals. Additionally there are
14 seperate time series uniquely identified by the keys
Length and Sex. distinct() can be
used to show categories and combinations of each variable.
olympic_running |> distinct(Sex)
olympic_running |> distinct(Length)
tsibble objectsdplyr functions include
mutate(),filter(),select(),summarise()`
PBS
This data set contains monthly data on Medicare Australia prescription data from July 1991 to June 2008.
select()PBS |>
filter(ATC2 == "A10")
filter()PBS |>
filter(ATC2 == "A10") |>
select(Month, Concession, Type, Cost)
Note that Month, Concession and
Type would be returned automatically to ensure that each
row contains a unique combination of index and keys.
summarise()PBS |>
filter(ATC2 == "A10") |>
select(Month, Concession, Type, Cost) |>
summarise(TotalC = sum(Cost))
mutate()Used to create new variables
PBS |>
filter(ATC2 == "A10") |>
select(Month, Concession, Type, Cost) |>
summarise(TotalC = sum(Cost)) |>
mutate(Cost = TotalC/1e6) # Convert from dollars to millions of dollars
Save the tsibble object
PBS |>
filter(ATC2 == "A10") |>
select(Month, Concession, Type, Cost) |>
summarise(TotalC = sum(Cost)) |>
mutate(Cost = TotalC/1e6) -> a10
tsibbleprison <- readr::read_csv("https://OTexts.com/fpp3/extrafiles/prison_population.csv")
## Rows: 3072 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): State, Gender, Legal, Indigenous
## dbl (1): Count
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
prison
When converting the table to a tsibble object we define
the index and key columns. Note that the data is quarterly but is stored
as individual days.
prison <- prison |>
mutate(Quarter = yearquarter(Date)) |>
select(-Date) |>
as_tsibble(key = c(State, Gender, Legal, Indigenous),
index = Quarter)
prison
Common periods for time intervals
| Data | Minute | Hour | Day | Week | Year |
|---|---|---|---|---|---|
| Quarters | 4 | ||||
| Months | 12 | ||||
| Weeks | 52 | ||||
| Days | 7 | 365.25 | |||
| Hours | 24 | 168 | 8766 | ||
| Minutes | 60 | 1440 | 10080 | 525960 | |
| Seconds | 60 | 3600 | 86400 | 6048000 | 31557600 |
More complicated and unusual seasonal patterns can be specified using
period() in the lubricate package.
ansett
Let’s look at the weekly economy passenger load on Ansett Airlines between their two largest cities.
melsyd_economy <- ansett |>
filter(Airports == "MEL-SYD", Class == "Economy") |>
mutate(Passengers = Passengers/1000)
autoplot(melsyd_economy, Passengers) +
labs(title = "Ansett airlines economy class",
subtitle = "Melbourne-Sydney",
y = "Passengers (,000)")
Note:
A model needs to take all these features into account.
Looking at our simpler time series from 2.1:
PBS |>
filter(ATC2 == "A10") |>
select(Month, Concession, Type, Cost) |>
summarise(TotalC = sum(Cost)) |>
mutate(Cost = TotalC/1e6) -> a10
autoplot(a10, Cost) +
labs(y = "$ (millions)",
title = "Australian antidiabetic drug sales")
This shows a clear and increasing trend.
a10 |>
gg_season(Cost, labels = "both") +
labs(y = "$ (millions)",
title = "Seasonal plot: Antidiabetic drug sales")
Useful for identifying Years with with unusual patterns. In this case we see 2008 dipping lower than usual in March 2008. The low figure in June could be the result of incomplete data.
period can be used for data with multiple seasonal
patterns by specifying daily, weekly or yearly patterns
vic_elec
vic_elec |> gg_season(Demand, period = "day") +
theme(legend.position = "none") +
labs(y = "Mwh", title = "Electricity demain in Victoria")
vic_elec |> gg_season(Demand, period = "week") +
theme(legend.position = "none") +
labs(y = "Mwh", title = "Electricity demain in Victoria")
vic_elec |> gg_season(Demand, period = "year") +
theme(legend.position = "none") +
labs(y = "Mwh", title = "Electricity demain in Victoria")
a10 |>
gg_subseries(Cost) +
labs(
y = "$ (million)",
title = "Australian antidibetic drug sales"
)
This plot collects each season into seperate mini time plots. The blue line indicates the mean. This plot is useful to clearly view the underlying seasonal pattern and shows the changes in seasonality over time. In this case it is not very interesting.
tourism
Lets consider the total visitor nights on holiday for each quarter.
holidays <- tourism |>
filter(Purpose == "Holiday") |>
group_by(State) |>
summarise(Trips = sum(Trips))
holidays
autoplot(holidays, Trips) +
labs(y = "Overnight trips",
title = "Australian domestic holidays")
The plot shows strong seasonality for most states but the peaks do
not necessarily coincide.To see the timing of those seasonal peaks use a
gg_season() plot.
gg_season(holidays, Trips) +
labs(y = "Overnight trips ('000)",
title = "Austrlian domestic holidays")
And a gg_subseries() plot
gg_subseries(holidays, Trips) +
labs(y = "Overnight trips ('000)",
title = "Austrlian domestic holidays")
Here we see the increase in Western Australia in recent years. Also Victoria’s strong increas in Q1 and Q4 but not in the other quarters.
The above plots are used to explore individual time series. Scatterplots can explore the relationship between two time series.
The following shows half-hourly electricity demand and temperature for 2014 in Victoria Australia.
vic_elec |>
filter(year(Time) == 2014) |>
autoplot(Demand) +
labs(y = "GW",
title = "Half-hourly electricity demand: Victoria")
vic_elec |>
filter(year(Time) == 2014) |>
autoplot(Temperature) +
labs(
y = "Degrees Celsius",
title = "Half-hourly temperatures: Melbourne, Australia"
)
vic_elec |>
filter(year(Time) == 2014) |>
ggplot(aes(x = Temperature, y = Demand)) +
geom_point() +
labs(x = "Temperature (degrees Celsius)",
y = "Electricity demand")
This shows a high demand when temperatures are high as well as when they are low.
The correlation coefficients measure the strength of the linear relationship between two variables.
\[ r = \frac{\sum (x_{t} - \bar{x})(y_{t}-\bar{y})}{\sqrt{\sum(x_{t}-\bar{x})^2}\sqrt{\sum(y_{t}-\bar{y})^2}} \]
for \(r\) between -1 and 1.
The correlation coefficient only measures the strenght of the linear relationship.
These plots all have correlation coefficients of 0.82 but have very different relationships. Plots are important. Do not rely on correlation values alone.
A useful starting point is to plot each variable agains each other variable.
visitors <- tourism |>
group_by(State) |>
summarise(Trips = sum(Trips))
visitors |>
ggplot(aes(x = Quarter, y = Trips)) +
geom_line() +
facet_grid(vars(State), scales = "free_y") +
labs(title = "Australian domestic tourism",
y= "Overnight trips ('000)")
visitors |>
pivot_wider(values_from=Trips, names_from=State) |>
GGally::ggpairs(columns = 2:9)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
recent_production <- aus_production |>
filter(year(Quarter) >= 2000)
recent_production |>
gg_lag(Beer, geom = "point") +
labs(x = "lag(Beer, k)")
Note the strong positive relationship at lags 4 and 8 reflecting seasonality. The strong negative correlations in lags 2 and 6 reflect peaks in Q4 being plotted against troughs in Q2.
Autocorrelation measures the linear relationship between lagged values of a time series.
There are several autocorrelation coefficients corresponding to each panel in the lag plots. Eg. \(r_1\) measures the relationship between \(y_t\) and \(y_{t-1}\), \(r_2\) for \(y_t\) and \(y_{t-2}\), etc.
\[ r_{k} = \frac{\sum\limits_{t=k+1}^T (y_{t}-\bar{y})(y_{t-k}-\bar{y})} {\sum\limits_{t=1}^T (y_{t}-\bar{y})^2} \]
where \(T\) is the length of the
time series. The autocorrelation function
ACF() can be used to compute the coefficients.
recent_production |> ACF(Beer, lag_max = 9)
The acf column contains \(r_1, r_2,\dots,r_9\). We can plot these with a correlogram.
recent_production |>
ACF(Beer) |>
autoplot() + labs(title = "Australian beer production")
For data with a trend there is strong autocorrelation for small lags because observations near in time are also near in value. Therefore the ACF will ususally show positive values which slowly decrease as the lags increase.
When data are seasonal the autocorrelation will be larger for seasonal lags (multiples of the seasonal period).
a10 is an example of a time series with both seasonal
and trend components.
a10 |>
ACF(Cost, lag_max = 48) |>
autoplot() +
labs(title = "Australian antidiabetic drug sales")
White noise refers to data with no autocorrelation
set.seed(30)
y <- tsibble(sample = 1:50,
wn = rnorm(50),
index = sample)
y |> autoplot(wn) + labs(title = "White noise", y = "")
y |>
ACF(wn) |>
autoplot() + labs(title = "White noise")
For white noise each autocorrelation value should be close to zero. More specifically 95% of the spikes should lie within \(\pm 2/\sqrt{T}\) where \(T\) is the length of the time series. If one or more large spikes or more than 5% of spikes lie outside the 95% bounds it is probably not pure white noise.